In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib as plt
from IPython.display import display_html

In [2]:
df = pd.read_csv('data/train.csv')
df.head(10)   # 打印出前 10 条看看样本数据


Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 NaN C

In [3]:
df.describe(percentiles=[.1, .25, .5, .75, .9])


Out[3]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
10% 90.000000 0.000000 1.000000 14.000000 0.000000 0.000000 7.550000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
90% 802.000000 1.000000 3.000000 50.000000 1.000000 2.000000 77.958300
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

从 describe 可以看出

  • 共有 891 个样本
  • Survived 存活率 38%
  • Pclass 超过 50% 的乘客是三等舱
  • Age
    • 有缺失数据
    • 平均年龄 30
    • 最小年龄 0.42 是错误的
    • 1/4 的人在 20 岁以下
  • SibSp 一般的人有配偶/兄弟姐妹
  • Parch 38% 的人有父母/子女
  • Fare 费用
    • 最小为 0
    • 3/4 的人费用集中在 31 左右
    • 90%/max 相差很大 则 max 值很有可能是错误的

分布情况 柱状图


In [4]:
fig = plt.pyplot.figure(figsize=(30, 4))
ax = fig.add_subplot(131)
ax.hist(df['Age'], bins=10, range=[df['Age'].min(), df['Age'].max()])
ax.set_xlabel('Age')
ax.set_ylabel('Age distribution')

ax = fig.add_subplot(132)
ax.hist(df['Fare'], bins=10, range=(df['Fare'].min(), df['Fare'].max()))
ax.set_xlabel('Fare')
ax.set_ylabel('Fare distribution')

ax = fig.add_subplot(133)
s = df['Fare']
ax.hist(s[s < s.max()], bins=10)
ax.set_xlabel('Fare without max')
ax.set_ylabel('Fare distribution')
plt.pyplot.show()


Box plot


In [5]:
df.boxplot('Fare', by='Pclass', figsize=(20, 4))


Out[5]:
<matplotlib.axes.AxesSubplot at 0x7f2116d84a90>

In [6]:
grouped = df.groupby('Pclass')

fig = plt.pyplot.figure(figsize=(30, 4))
ax = fig.add_subplot(121)
ax.set_title('Pclass count')
ax.set_xlabel('Pclass')
ax.set_ylabel('Count')
grouped.Survived.count().plot(kind='bar')

ax = fig.add_subplot(122)
ax.set_title('Pclass survived')
ax.set_xlabel('Pclass')
ax.set_ylabel('Survived Percentage')
(grouped.Survived.sum() / grouped.Survived.count()).plot(kind='bar')


Out[6]:
<matplotlib.axes.AxesSubplot at 0x7f2116c98990>

In [7]:
df2 = pd.crosstab([df.Pclass, df.Sex], df.Survived.astype(bool))
display_html(df2)
df2.plot(kind='bar', stacked=True, color=['red', 'g'], figsize=(20, 5), fontsize=16)


Survived False True
Pclass Sex
1 female 3 91
male 77 45
2 female 6 70
male 91 17
3 female 72 72
male 300 47
Out[7]:
<matplotlib.axes.AxesSubplot at 0x7f2116c245d0>

In [8]:
from IPython.display import FileLink
FileLink('Titanic baby step for pandas Part 2.ipynb')